Exploiting Sophisticated Representations for Document Retrieval
نویسنده
چکیده
The use of NLP techniques for document classification has not produced significant improvements in performance within the standard term weighting statistical assignment paradigm (Fagan 1987; Lewis, 1992bc; Buckley, 1993). This perplexing fact needs both an explanation and a solution if the power of recently developed NLP techniques are to be successfully applied in IR. A novel method for adding linguistic annotation to corpora is presented which involves using a statistical POS tagger in conjunction with unsupervised structure finding methods to derive notions of "noun group", "verb group", and so on which is inherently extensible to more sophisticated annotation, and does not require a pre-tagged corpus to fit. One of the distinguishing features of a more linguistically sophisticated representation of documents over a word set based representation of them is that linguistically sophisticated units are more frequently individually good predictors of document descriptors (keywords) than single words are. This leads us to consider the assignment of descriptors from individual phrases rather than from the weighted sum of a word set representation. We investigate how sets of individually high-precision rules can result in a low precision when used together, and develop some theory about these probably-correct rules. We then proceed to repeat results which show that standard statistical models are not particularly suitable for exploiting linguistically sophisticated representations, and show that a statistically fitted rule-based model provides significantly improved performance for sophisticated representations. It therefore shows that statistical systems can exploit sophisticated representations of documents, and lends some suppor t to the use of more linguistically 65 sophisticated representations for document classification. This paper reports on work done for the LRE project SmTA, which is creating a PC based tool to be used in the technical abstracting industry. 1 M o d e l s and Representa t ions First, I discuss the general paradigm for document classification, along with the conventions for notation used throughout this document. We have a set of documents {zi}, and set of descriptors, {di}. Each document is represented in one or more ways in some domain, usually as a set. The elements of this set will be called diagnostic units or predicates, {wi} or {¢i). These diagnostic units might be the words comprising the document, or more linguistically sophisticated annotations of parts of the document. They may, in general, be predicates over documents. The representation of the document by diagnostic units will be called the DU-representation of the document, and for a document z, will be denoted T~(x). From the DU representation of the documents, one or more descriptors are assigned to each of them by some automatic system. This paradigm of description is applicable to much of the work on text classification (and other fields in information retrieval). This paper assesses the utility of using linguistically sophisticated diagnostic units together with a slightly non-standard statistical assignment model in order to assign descriptors to a document.
منابع مشابه
Aggregation-Based Structured Text Retrieval
DEFINITION Text retrieval is concerned with the retrieval of documents in response to user queries. This is achieved by (i) representing documents and queries with indexing features that provide a characterisation of their information content, and (ii) defining a function that uses these representations to perform retrieval. Structured text retrieval introduces a finer-grained retrieval paradig...
متن کاملIntelligent Retrieval of Hypermedia Documents
Intelligent retrieval of hypermedia documents requires sophisticated document representations and querying facilities that allow for content-based and fact-based querying as well as considering the structure of documents. This paper describes POOL, a Probabilistic Object-Oriented four-valued Logic, which allows a uniform view on hypermedia documents for the purpose of their retrieval: documents...
متن کاملDocument Image Retrieval Based on Keyword Spotting Using Relevance Feedback
Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...
متن کاملRMIT @ TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback
The TREC Dynamic Domain search task addresses search scenarios where users engage interactively with search systems to tackle domain specific information needs. In our participation, we focused on utilizing passage-based representations in document retrieval and user feedback processing. In addition, we submitted a baseline retrieval method and a manual run that considers only relevant document...
متن کاملA Query Algebra for Quantum Information Retrieval
The formalism of quantum physics is said to provide a sound basis for building a principled information retrieval framework. Such a framework is based on the notion of information need vector spaces, where events, such as document relevance, correspond to subspaces, and user information needs are represented as weighted sets of vectors. In this paper, we look at possible ways to build, using an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994